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A Bit of Information Theory, and the Data 
Augmentation Algorithm Converges 

Yarning Yu, Member, IEEE 



Abstract — The data augmentation (DA) algorithm is a simple 
and powerful tool in statistical computing. In this note basic 
information theory is used to prove a nontrivial convergence 
theorem for the DA algorithm. 

Index Terms — Gibbs sampling, information geometry, I- 
projection, Kullback-Leibler divergence, Markov chain Monte 
Carlo, Pinsker's inequality, relative entropy, reverse I-projection, 
total variation 



I. Background 

In many statistical problems we would like to sample 
from a probability density tt{x, y), e.g., the joint posterior 
of all parameters and latent variables in a Bayesian model. 
When tt{x, y) is complicated, direct simulation may be im- 
practical; however, if the conditional densities TTx\Y{x\y) 
and TTY\x{y\x) are tractable, the following algorithm is an 
intuitively appealing alternative. Draw {X, Y) from an initial 
density p^'^'>{x,y), and then alternatingly replace X by a 
conditional draw given Y according to TTx\Y{^\y)^ Y by 
a conditional draw given X according to 7ry|x(y|a;); this is 
a crude description of the data augmentation (DA) algorithm 
of Tanner and Wong [18] (see also [15], [20] and [22]), a 
powerful and widely used method in statistical computing. 

It is not immediately obvious that iterates of the DA 
algorithm should approach the target tt{x, y). To show conver- 
gence, one usually appeals to Markov chain theory (Tierney 
[19]), which says that (roughly) if a Markov chain is irre- 
ducible and aperiodic, and possesses a stationary distribution, 
then it converges to that distribution. Such results are often 
stated in terms of the total variation distance, defined for two 
densities p and q as 



Because iterates of the DA algorithm form a Markov chain, 
they converge in total variation under some regularity condi- 
tions. 

Total variation, of course, is not the only discrepancy 
measure. There is actually another discrepancy measure that 
is natural for the problem, yet rarely explored. Recall that the 
relative entropy, or Kullback-Leibler divergence, between two 
densities p and q is defined as 



D{p\q) = J p\og{p/q). 



It is related to V{p, q) via the well-known Pinsker's inequality 

Dip\q) > lv\p,q), 
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SO that for a sequence of densities pt, t — 0, 1,---, 

limt^oo£'(Pt|Poo) = implies limf^oo 1^ (ft,Poo) = 0. 
Other useful properties of relative entropy can be found in 
Cover and Thomas [3]. 

It is the purpose of this note to analyze the DA algorithm 
in terms of relative entropy and present a short proof of a 
convergence result (Theorem 12.1b using simple information 
theoretic techniques. 

II. Main Result 

Let /i X be a product measure on a joint measurable space 
{X X y, J-xQ). Suppose the target density tt{x, y) with respect 
to fi X ly satisfies TT{x,y) > for all {x,y) £ X x y (in 
statistical applications often X and 3^ are subsets of Euclidean 
spaces and each of /i and v is either Lebesgue measure 
or the counting measure). Formally, given an initial density 
p''^\x, y), the DA algorithm generates a sequence of densities 



P'''Kx,y), t > 0, where (p^^(a;) = Jyp^*'>{x,y) dv{y), for 



example) 
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(t+i; 



ix,y) 



Px\^)^Y\x{y\x), t odd; 



PYiy)'^x\Y{x\y), t 



(1) 



Theorem 2.1: If 7r(a;, y) > for all [x, y) ^ X x y, 
and D (p(°)|7r) < oo, then iterates of the DA algorithm ([T]i 



converge in relative entropy, i.e.. 



lim D 



-0, 



and limt^oo V (p^*',7r^ = necessarily. 

The condition Tr{x,y) > 0, {x,y) ^ X x y, can be 
weakened, and the result can be generalized to the Gibbs 
sampler ([11] [10]); see Yu [21]. Note that the conditions of 
Theorem 12.11 are akeady weaker than those of Schervish and 
Carlin [17], for example (see also Liu et al. [13]), although 
Theorem 12.11 does not give a qualitative rate of convergence. 
As a general comment, the approach taken here complements 
the more traditional L2 approach (Amit [1]) that studies 
the Gibbs sampler in the Hilbert space of square integrable 
functions. 

Section III provides a short, self-contained proof of The- 
orem 12.11 The main tools (Lemmas 13. 1143. 3t exploit the 
information geometry of the DA algorithm. Although relative 
entropy does not define a metric, it behaves like squared 
Euclidean distance. See Csiszar [4], Csiszar and Shields [6], 
and Csiszar and Matus [5] for the notions of I-projection and 
reverse I-projection that explore such properties in broader 
contexts. 



III. Proof of Theorem I2.1I 

In this section let p'*^ be a sequence of densities generated 
according to ([T]i with Z?(p("^|7r) < 00. Lemma [TTI captures 
the intuition that each iteration is a projection (more precisely, 
a reverse I-projection) onto the set of densities with a given 
conditional. The proof is simple and hence omitted. 

Lemma 3.1: For all t >0, 



D 



= D 



it) 



D 



(t+i) 



2 



According to Lemma ITTl D (p^^*^ |7r) can only decrease in t 
(this holds for Markov chains in general). However, it does not 
imply D (p(*'|7r) [ 0. To prove the theorem we need further 
analysis. 

Lemma 3.2: Let t>\ and n > 1. If n is even then 



. (2) 



if n is odd then 

(3) 

Proof: To prove without loss of generality assume 
t is odd. Since n is even, p'*) and have the same 

conditional p'x\Y ~ PxIy' ~ ^x|y> whereas py^"' = 

(t+n-l) 



the last inequality being a basic property of relative entropy 
(Cover and Thomas [3]). The proof of (|3]l, omitted, is the same 
as that of Lemma 13.11 ■ 
Lemma 3.3: For all t > 1 and n > we have 

D < D Itt) - D Itt) . (4) 

Proof: Let us use induction on n. The case n = is 
trivial. Suppose (01 has been verified for all n' < n. When n 
is even, we apply (|2]l, the induction hypothesis, and Lemma 
l3.1lto obtain 



< D 



When n is odd, by (O, the induction hypothesis, and then 
Lemma [TTl we have 



Corollary 3.1: There exists some density tt* such that 

limt^^V {p^'\n*) =0. 

Proof: Pinsker's inequality and ^ imply 

for all t,k > 1. Because Z3(p(*)|7r) is finite and decreases 
monotonically in t, \imt^k->ociV {p^*\p^'^'>) = 0, i.e., p^*) is 
a Cauchy sequence in Li{X x y). Hence p^*) converges in 
Li{X X 3^) to some density tt*. (Only the completeness of 



Li{X X 3^) is used here. Further properties of Lp spaces can 
be found in real analysis texts such as Roy den [16].) ■ 
Proposition 3.1: In the setting of Corollary 13.11 tt* = vr. 
Proof: Since p'*^ t > 1, has the conditional 7rx|y when 
t is odd, and tiy\x when t is even, the conditionals of tt* must 
match those of tt, i.e., 

■n*{x,y) = 7ry(y)7rjf|y(a;|y) = 'K*x{x)'KY\x{y\x), (5) 

almost everywhere. Under the assumption TT{x,y) > 0, dD 
implies 



TTxix) 



Integration over y yields 1 — ttx{x)/ttx{x), which, together 
with Q, proves tt* = tt. ■ 
Finally we finish the proof of Theorem 12. II by showing that 
the convergence in Corollary [TT] also holds in relative entropy. 



Lemma 3.4: limt^^o D (p(*'|7r) = 0- 

Proof: We already have D (p^*'|7r) | d, say, with d> 0. 
Taking n ^ oo in (|4]l we get 

liminf L> (p(*)|p(*+")) < D (p(*V) - d. 

On the other hand, since 

D (p(*)|p(*+")) = J p(*) log (p(*)/p(*+")) -p(*) + p(*+") 

and the integrand is non-negative, by Fatou's Lemma we have 

liminf D (p(*)|p(*+")) > D (p(*V) (6) 

which forces d ~ Q. The proof is now complete. Note that ^ 
is a case of the more general lower semi-continuity property 
of relative entropy (Csiszar [4]). ■ 

IV. Remarks 

As pointed out by an anonymous reviewer, the 
core of Section III consists of two parts: (i) showing 
Mmt^aoV [p^*\tt*) = for some tt*, whose conditionals 
match those of tt, and (ii) showing that tt* = tt. Part (i) 
can be phrased more generally and is related to the results 
of Csiszar and Shields ([6], Theorem 5.1) on alternating 
I-projections. It is also related to the information theoretic 
treatment of the EM algorithm ([8] [14]) of Csiszar and 
Tusnady [7]. The condition TT{x,y) > 0, not used in part (i), 
can be replaced by a weaker assumption, as long as one can 
show that there exists no density other than vr that possesses 
the two conditionals ttx\y ™d tty\x- 

Lemma 13.11 appears in Yu [21]. Lemmas 13.21 and 13.31 are 
new. Generalizations of Theorem l2.1l to the Gibbs sampler with 
more than two components are possible ([21]), but technically 
more involved, because Lemmas 13.21 and 13.31 are tailored to 
the two component case. The issue of the rate of convergence, 
not addressed here, is definitely worth investigating. 

The DA algorithm has the following feature. If we let 

be the iterates generated, i.e., 
the conditional distribution of is TTy^x and that 

of xf'^+i^lrW is TTxiy, then each of {X'-''^} and {F^} 



forms a reversible Markov chain. Fritz [9], Barron [2], and 
Harremoes and Hoist [12] apply information theory to prove 
convergence theorems for reversible Markov chains. Their 
results may be adapted to give an alternative (albeit less 
elementary) derivation of Theorem 12.11 
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